Data mining in biosciences
M2 DMBS
2024-2025
Marie-Joe Karam Kassar
In this project, you will extract Open Reading Frames (ORFs) from a partial human transcriptome assembly provided in the Homo_sapiens_cdna_assembled.fasta file. You will set up a Docker container to conduct the analysis in a controlled environment.
Before proceeding, ensure you have:
Homo_sapiens_cdna_assembled.fasta).{ORF_ID <str>: (sequence <str>, length <int>, source_contig <str>, (start <int>, end <int>), reading_frame <int>)
}
#! /bin/bash
wget https://ftp.ncbi.nlm.nih.gov/blast/db/v5/swissprot.tar.gz
tar -xvf swissprot.tar.gz
uniprot/swissprot to identify coding sequences (CDS) and UTR regions by runnong BLAST. You can run BLAST from Python using the subprocess package, otherwise from bash.#! /bin/bash
# 9606 is the taxid for human
blast? –db /path/to/swissprot –query /path/to/multifastafile –taxids 9606 –outfmt 7 –out /path/to/output/file.tsv
The length of an ORF can be modeled using a geometric distribution, where each codon represents a trial, and encountering a stop codon is considered a "success". Assuming that each of the 64 possible codons in the genetic code is equally likely, p is the probability of encountering a stop codon in a single trial. Using the geometric distribution, determine the expected number of codons (trials) until a stop codon is encountered. Recall that for a geometric distribution, the expected number of trials until the first success is E[X] = 1/p.
Your analysis must be run within a docker container.
python:3.12-slim as the base image.blast+ (https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz).uniprot/swissprot database for human proteins.Dockerfile:FROM python:3.12-slim
RUN apt-get update && \
apt-get install -y wget libgomp1
WORKDIR /app
RUN wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz && \
tar -xzvf ncbi-blast-2.16.0+-x64-linux.tar.gz && \
mv ncbi-blast-2.16.0+ /usr/local/ && \
rm ncbi-blast-2.16.0+-x64-linux.tar.gz
ENV PATH="/usr/local/ncbi-blast-2.16.0+/bin:$PATH"
RUN wget https://ftp.ncbi.nlm.nih.gov/blast/db/v5/swissprot.tar.gz && \
mkdir -p /db/swissprot && mv swissprot* /db/swissprot && \
tar -xzvf /db/swissprot/swissprot.tar.gz -C /db/swissprot
COPY . /app
blastp -db /db/swissprot/swissprot -query /analysis/test.fasta -taxids 9606 -outfmt 7 -out test.out
To build your Docker image and test it by running a container interactively, follow these steps:
dockerfiles/blast/v2.16.0dockerfile within that folder and paster the above dockerfile content.docker/images/blast/v2.16.0 folder run:docker build -t blast:v2.16.0 .
Then go to your working directory containing your scirpt and input file and run a container as follows:
docker run -it --rm -v /path/to/your/workingdire:/analysis blast:v2.16.0 bash
Your final submission should be submitted on github with at least these files:
orf_detection.py script.